From Linguistic Theory to Syntactic Analysis: Corpus-Oriented Grammar Development and Feature Forest Model
نویسنده
چکیده
The goal of this thesis is to establish a system for the automatic syntactic analysis of real-world text. Syntactic analysis in this thesis denotes computation of in-depth syntactic structures that are grounded in syntactic theories like Head-Driven Phrase Structure Grammar (HPSG). Since syntactic structures provide essential components for computing meanings of natural language sentences, the establishment of syntactic analyzers is a starting point for intelligent natural language processing. Syntactic analyzers are strongly demanded in natural language processing applications, including question answering, dialog systems, and text mining. To date, however, few syntactic analyzers can process naturally occurring sentences such as newswire texts. This task involves two significant obstacles. One is the scalability of a grammar to analyze realworld texts. Grammar theories that successfully worked in a toy system could not be applied to the analysis of real-world sentences. Despite intensive research on syntactic analysis, development of wide-coverage grammars is almost impractical. This is due to the inherent difficulty in scaling up a grammar; as a grammar becomes larger, the maintenance of the consistency of the grammar is more difficult. Modern syntactic theories, which are called lexicalized grammars, explain diverse syntactic structures with various combinations of lexical entries to express word-specific constraints and linguistic principles to represent generic syntactic regularities. However, grammar writers cannot simulate in their mind all possible combinations of lexical entries and linguistic principles. Notably, a number of lexical entries are required to treat real-world sentences, and the consistent expansion of lexical entries creates a bottleneck in the scaling up of lexicalized grammars. The problem is further deteriorated by the complicated data structures required in linguistic theories to express in-depth syntactic regularity. The first proposal of this thesis is a new methodology for the development of lexicalized grammars. The method is corpus-oriented, in the sense that the objective of the grammar development is the construction of an annotated corpus, i.e., a treebank, rather than a lexicon. This methodology supports an inexpensive development of lexicalized grammars owing to the systematic control of grammar inconsistencies and the reuse of existing linguistic resources. First, grammar developers define linguistic principles that conform to a target syntactic theory, i.e., HPSG in our case. Next, existing linguistic resources, such as Penn Treebank, are converted into an HPSG treebank. The major work of grammar developers is to maintain the conversion process with the help of consistency checking by principles. That is, because conflicts in a grammar are automatically detected as violations of principle applications to a treebank, grammar writers can easily identify sources of inconsistencies. When we have a sufficient treebank of HPSG, a lexicon is collected from terminal nodes of HPSG syntactic structures in the treebank. Lexicon collection is completely deterministic; that is, treebank construction theoretically subsumes lexicon development. The other obstacle is the modeling of preference of natural language syntax. Since linguistic research on syntax has focused on structural regularity, modeling of preference was not respected. However, it is indispensable for automatic syntactic analysis because applications usually require disambiguated or ranked parse results. Since probabilistic models attained great success in CFG
منابع مشابه
Exemplar-Based Syntax: How to Get Productivity from Examples
Exemplar-based models of language propose that human language production and understanding operate with a store of concrete linguistic experiences rather than with abstract linguistic rules. While exemplarbased models are well acknowledged in areas like phonology and morphology, common wisdom has it that they are intrinsically flawed for syntax where infinite generative capacity is needed. This...
متن کاملCultural Influence on the Expression of Cathartic Conceptualization in English and Spanish: A Corpus-Based Analysis
This paper investigates the conceptualization of emotional release from a cognitive linguistics perspective (Cognitive Metaphor Theory). The metaphor weeping is a means of liberating contained emotions is grounded in universal embodied cognition and is reflected in linguistic expressions in English and Spanish. Lexicalization patterns which encapsulate this conceptualization i...
متن کاملA DOP Model for Semantic Interpretation
In data-oriented language processing, an annotated language corpus is used as a stochastic grammar. The most probable analysis of a new sentence is constructed by combining fragments from the corpus in the most probable way. This approach has been successfully used for syntactic analysis , using corpora with syntactic annotations such as the Penn Tree-bank. If a corpus with semantically annotat...
متن کاملAn annotation scheme for Persian based on Autonomous Phrases Theory and Universal Dependencies
A treebank is a corpus with linguistic annotations above the level of the parts of speech. During the first half of the present decade, three treebanks have been developed for Persian either originally or subsequently based on dependency grammar: Persian Treebank (PerTreeBank), Persian Syntactic Dependency Treebank, and Uppsala Persian Dependency Treebank (UPDT). The syntactic analysis of a sen...
متن کاملA Study of \"Khetab be Parvane ha\" from the Perspective of Persian Grammar
Contemporary poetry can be divided into the poetry of before and after the Islamic Revolution. Among the post-revolutionary poetry, the Pishro or avant-garde Poetry is the most important style of poetry. The well-known figure of this kind of poetry is Reza Baraheni (1935- ) who became the most influential poet of the Post-Revolution Poetry with the publication of his Khetab be Parvane ha and in...
متن کامل